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Abstract 

arXiv.org mediates contact with the hterature for entire scholarly communities, 
both through provision of archival access and through daily email and web announce- 
ments of new materials, potentially many screenlengths long. We confirm and ex- 
tend a surprising correlation between article position in these initial announcements, 
ordered by submission time, and later citation impact, due primarily to intentional 
"self-promotion" on the part of authors. A pure "visibility" effect was also present: 
the subset of articles accidentally in early positions fared measurably better in the 
long-term citation record than those lower down. Astrophysics articles announced in 
position 1, for example, overall received a median number of citations 83% higher, while 
those there accidentally had a 44% visibility boost. For two large subcommunities of 
theoretical high energy physics, hep-th and hep-ph articles announced in position 1 
had median numbers of citations 50% and 100% larger than for positions 5-15, and 
the subsets there accidentally had visibility boosts of 38% and 71%. 

We also consider the positional effects on early readership. The median numbers 
of early full text downloads for astro-ph, hep-th, and hep-ph articles announced in 
position 1 were 82%, 61%, and 58% higher than for lower positions, respectively, and 
those there accidentally had medians visibility-boosted by 53%, 44%, and 46%. Finally, 
we correlate a variety of readership features with long-term citations, using machine 
learning methods, thereby extending previous results on the predictive power of early 
readership in a broader context. We conclude with some observations on impact metrics 
and dangers of recommender mechanisms. 
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1 Introduction 



Astrophysics 

New submissions 

Submissions received from Thu 11 Dec 08 to Fri 12 Dec 08, announced Mon, 15 Dec 08 

• New submissions 

• Cross-lists 

• Replacements 

[tola] of 60 entries: 1-60] 

I showing up to 2000 entries per page: fewer I iniiie ] 

New submissions for Mon, 15 Dec 08 

[1] arXiv:08 12.2904 [ps. pdf . other ] 

Title: Observational Evidence for Cosmological- Scale Extra Dimensions 
Authors: Y. Ali-Haimoud . C. M. Hirata . C. Dickinson 

Comments: 26 pages, 14 figures. To be submitted to MNRAS. The companion code, SPDUST, can be 
downloaded from this http URL 
Subjects: Astrophysics (astro-ph) 

We present a case that current observations may already indicate new gravitational physics on 
cosmological scales. The excess of power seen in the Lyman-alpha forest and small-scale CMB 
experiments, the anomalously large bulk flows seen both in peculiar velocity surveys and in kinetic 
SZ. and the higher ISW cross- correlation all indicate that structure may be more evolved than 
expected from LCDM. We argue that these observations find a natural explanation in models with 
infinite-volume (or, at least, cosmological- size) extra dimensions, where the graviton is a resonance 
with a tiny width. The longitudinal mode of the graviton mediates an extra scalar force which speeds 
up structure formation at late times, thereby accounting for the above anomalies. The required 
graviton Compton wavelength is relatively small compared to the present Hubble radius, of order 
300-600 Mpc. Moreover, with certain assumptions about the behavior of the longitudinal mode on 
super-Hubble scales, our modified gravity framework can also alleviate the tension with the low 
quadrupole and the peculiar vanishing of the CMB correlation function on large angular scales, seen 
both in COBE and WMAP. This relies on a novel mechanism that cancels a late-time ISW 
contribution against the primordial Sachs-Wolfe amplitude. 

[2] arXiv:0S12.2245 [ps. pdf . other ] 

Title: Relalivistic Simulations of Black Hole-Neutron Star Mergers: Effects of black-hole spin 
Authors: Nikhil Padmanabhan . Martin White . J.D. Cohn 
Comments: 6 pages, 3 figs, PRD submitted. (v2) typo fixed in Eq. 5 
Subjects: Astrophysics (astro-ph) 

Black hole-neutron star (BHNS) binary mergers are candidate engines for generating both short-hard 
gamma-ray bursts (SGRBs) and detectable gravitational waves. Using our most recent conformal 
Ihin-sandwich BHNS initial data and our fully general relativistic hydrodynamics code, which is now 
AMR-capable, we are able to efficiently and accurately simulate these binaries from large separations 
through inspiral, merger, and ringdown. We evolve the metric using the BSSN formulation with the 



Figure 1: New astro-ph listings, from http://arXiv.org/list/astro-ph/new . Note that a 
standard sized Web or e-mail browser window may not accommodate even the full entries 
in the first two positions without requiring scrolling down. The astro-ph listings averaged 
roughly thirty such entries every weekday during the period studied here. 

The arXi4l] repository currently contains over 500,000 documents and during calendar 
year 2009 is growing at a rate of over 64,000 new submissions per year. For over a decade, it 
has been the primary means of access to the research literature in many fields of physics and 
in some related fields. Its log data provides the basis for many studies of user behavior during 
this unique transition period from print to electronic medium. The arXiv corpus is divided 
into different subject areas, with corresponding constituent subcommunities. Each of these 
sub communities receives notifications each weekday of new articles received in the relevant 
subject area, either by subscription to email announcements or by checking the web page 
of newly received submissions in the relevant subject area, updated daily (or, equivalently, 
through the associated RSS feed). These daily listings, viewed either through a web browser 
or email client, consist of standard metadata, including title and author information, and as 
well the full abstracts. As depicted in fig. [H this means that it is necessary to scroll down to 
see beyond the entry in the second position, and to scroll down many times to see the entries 

^http://arXiv.org/. For a recent overview, see [Ginsparg, 2007| . 
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in positions near the end of the daily announcements. While the overall order of articles is 
retained when browsing through the archival monthly listings, no trace of the boundaries 
between days is retained, hence the daily positional information is lost, and of course articles 
retain no vestige of their position in original daily announcement when retrieved via the 
search interface. 

In what follows here, we investigate the effect of article position in these daily announce- 
ments for certain physics subfields of arXiv, a purely short-term phenomenon, on citations 
received over the long-term. This effect for the astro-ph subject area, primarily used by 
astrophysicists, was first considered in [Dietrich, 2008a,b]. Here we will consider as well two 
other communities of users, those of the hep-th and hep-ph subject areas ("High Energy 
Physics - Theory" and "High Energy Physics - Phenomenology"). The hep-th subject area 
is the original arXiv subject area initiated in mid 1991, covering highly theoretical areas of 
particle physics such as string theory. The hep-ph subject area was started in early 1992, 
covering areas of theoretical particle physics more directly related to experiment. During the 
2002-2004 periods to be studied here, hep-th and hep-ph received an average of roughly 3320 
and 4110 new submissions per year, respectively. The astro-ph area, started later in 1992, 
is an amalgam of many types of relevant theory and experiment, from stellar to galactic 
to cosmological, and by 2005 had grown to exceed the combined size of the High Energy 
Physics subject areaso The astro-ph subject area averaged roughly 7720 new submissions 
per year from 2002-2004, and grew to over 9000 new submissions per year in 2005-2006. 

A strong correlation between the position of articles in their initial announcement and 
the number of citations later received was found in [Dietrich, 2008a,b]. Since position in the 
daily announcement of newly received submissions is a one-day artifact, visible only that day 
and with no trace afterwards, it is extraordinarily surprising that it could nonetheless be 
correlated with long-term citation counts, accumulated years later. Due to the weight given 
to citations as a measure of research impact, it is important to verify such an unexpected 
effect by different methods, and assess whether some analog exists as well in other commu- 
nities. Our results here confirm the effect discovered in [Dietrich, 2008a,b], and suggest that 
arXiv subject area organization and interface design should be reconsidered either to utilize 
or to counter such unintentional biases. 

It is evident to readers that a fraction of authors, working entirely within the established 
operating procedures for the site, has been jockeying for top position in the daily announce- 
ments. Since late 2001, the policy has been that submissions received until 16:00 US eastern 
time (EST/EDT) on a given weekday are announced at 20:00 eastern time, and submissions 
received after that deadline are announced the following day, in rougljfl order of receipt. 
Articles submitted shortly after 16:00 will thus be listed at or near the top of the next day's 
announcement, and will potentially receive greater visibility. Submitters are evidently con- 
forming their schedules to take advantage of some presumed benefit to the greater visibility 
afforded by submitting within this time window. 

Fig.[2]shows the submission counts, broken down by the time of submission, of arXiv: astro- 
ph from the beginning of 2002 through the end of Mar 2007. The spike in submissions corre- 
sponds to the period 16:00-16:10. That ten minute bin contains 5 times as many submissions 

^http: / / arxiv.org/Stats/hcamontlily.html 
■^See sec. 12.11 for important exceptions. 
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Figure 2: Number of astro-ph submissions by time of day, in 10 minute bins, during the 
period Jan 2002 - Mar 2007. 



as any other bin outside of the 16:00-16:30 period. Other variations during the day visible 
in the figure correlate with periodicity of overall activity levels, resulting from the effects of 
users in different timezones. (The period between 10 a.m. and noon eastern time, for exam- 
ple, corresponds to late afternoon in western Europe and early morning in western U.S. The 
server itself is not affected by any excessive operating load during the 16:00 period, since 
the submissions are automatically serialized by time of receipt. Typical submissions take 
under a second to process, and no noticeable processing queue develops from the [at most] 
few tens of submissions in that initial minute on a busy day, while the server simultaneously 
processes multiple retrievals and searches per second. The average submission rate during 
the rest of the day is roughly one new submission every six minutes.) 

It is important to note that the positional effects are potentially much more dramatic 
than, say, the corresponding effects in presentation of search results. In the latter case, 
typically ten results are presented on a single web page, with each result entry reduced to 
a small number of lines of key text. Eye-tracking studies [Granka et al., 2004| have shown 
the extent to which users nonetheless tend to focus only on the top few entries. In the 
case of arXiv announcements, on the other hand, the entries consist of entire abstracts (see 
fig. [1]). Only the first two entries are visible in a standard sized Web or e-mail browser 
window, and it is necessary to scroll down to see the remainder. The situation is thus more 
comparable to viewing successive pages of search results, where for example analysis of log 
data in [Fortunato et al., 2006| suggested a click probability that decreased with result rank 
as r~^'^^. 

In the sections below, we consider the positional effects on both citation and readership, in 
an attempt to understand author and reader behavior, and ascertain whether the policies of 
the arXiv system itself need modification to counter any unexpected long-term consequences 
of a seeming short-term artifact. 
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2 Effects on Citation 



2.1 Previous Work 

Dietrich, 2008a| used the SPIRES High- Energy Physics Literature Databas^ to reconstruct 



the daily arXiv astro-ph maihngs from Jul 2002 through Dec 2005, giving the articles at least 
a year to gather citations. The citations were collected from the SAO/NASA Astrophysics 
Data System (ADS) bibliographic service^ in December 2006. It was inferred that on 
average, articles in positions 1 get 89.8 ± 9.0 citations, while those in positions 10-40 get 
44.6 ±0.9 citations. Three possible explanations were suggested for this: self-promotion bias 
(SP), visibility bias (V), and geographic bias (G). The self-promotion argument assumes that 
authors can intuit in advance the quality of their articles and specifically aim to promote the 
better ones through early submission]^ Enough of these higher quality articles are submitted 
in the critical time window to result in the measured citation advantage for submissions in 
the first few positions. The visibility argument is that the initial higher visibility translates to 
higher readership, and some fraction of that higher readership translates to higher citations 
later on. The geographic argument is that articles submitted during the critical period are 
more likely to come from North America due to timezone differences, and those might be more 
likely to be cited for other reasons. Comparing overall citation trajectories of submissions 
from Europe and North America, however, permitted exclusion of the geographic bias in 



Dietrich, 2008a , and it will not be considered further here. 

Using submission times later provided from arXiv log data, a subsquent comparison of 
three sets of articles was undertaken in [Dietrich, 2008b to disentangle the SP and V biases. 



The first set contained articles that appeared in the first three positions and were submitted 
within the first five minutes after the deadline, hence inferred to have been submitted with 
an intention to be listed at or near the top. The second set contained articles that were 
submitted after the first ninety minutes, and yet appeared in the first three positionslll 
These are assumed not to be self-promoted. The last set contained articles in positions 26- 
30. It was observed that the self-promoted articles received more citations than those in the 
other two sets. The articles that fortuitously appeared near the top, however, also appear to 
receive more citations than had they appeared in a lower position, indicating as well some 
visibility bias. The increase in citations due to the visibility bias was found to be smaller 
than that due to the self-promotion bias. 

The methodology used in [Dietrich, 2008a,b] to quantify the citation effects involves 
fitting the citation distributions to a power law, excluding the regimes of data that do not 
follow the power law (the head and the tail of the distributions), and averaging the rest. 



^http://www.slac.stanford.edu/spires/hep/ 
^littp: / / adsabs.harvard.edu 



"This is related in spirit to the 'self-selection' postulate Kurtz at al., 2005b , which suggests that more 
prestigious articles, i.e., those more likely to be cited, are more likely to be made freely accessible. In the 
current context, the suggestion is that those articles are as well promoted by authors to the top of a daily 
list of new freely accessible articles. 

^This can happen either because there were few or no early submissions, or because an administrative 
removal of an early article caused a later submitted article to be shifted to that earlier position to fill the 
gap. 
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Power law fitting can be tricky [Newman, 2005 , and as described in Appendix [XJ the above 
methodology results in inadvertent biases, including using only a portion of the data. Due 
to the sociological importance of the result, it is useful to reconsider the results of [Dietrich, 
2008a,b] using slightly different methods. 



2.2 Methodology 

For heavy-tailed distributions such as power laws, the mean can be strongly affected by the 
large values at the tail. A more robust statistic is the median, which is not affected by the 
large values, and is also representative of the large number of small values in the sample set. 
For this reason, nonparametric statistical methods often use the median. More generally, we 
can consider the k^^ percentile as the aggregate measure of a set of values. If the quartiles 
(25**^ and 75**^ percentiles, usually denoted Ql and Q3, respectively) and the median of a 
distribution are larger than the same quantitites of another distribution (at a statistically 
significant level), then stochastic dominance (see Appendix E]) is likely. The interquartile 
range (the difference Q3 — Ql) measures the spread of the distribution, analogous to the 
variance of a normal distribution. We analyze the citation data by presenting plots of the 
median and the quartiles and check for statistical significance, using the nonparametric 
Mann- Whitney U (also known as Wilcoxon rank-sum test) and Kolmogorov-Smirnov tests 
Gibbons, 1997] . 

We consider the 23,165 arXiv astro-ph articles from the beginning of 2002 through the end 
of 2004, announced in 777 daily announcements (via one-time email announcements and web 
pages daily updated), with an average mailing containing 29.8 papers. The citations were 
collected from NASA's Astrophysics Data System (ADS) Bibliographic Services in August 
2008, giving the articles over three and half years to gather citations. There are thus 777 
articles in each the top positions (and roughly that number in the rest of the positions, at 
least up to the typical number per announcement). 

Fig. [3] shows the median citations and quartiles for each position. The later positions are 
binned to reduce noise. From position 1, the median decreases until position 5, and beyond 
position 7 the medians effectively cease changing. The upper quartile (upper boundary of 
boxes) shows a more pronounced decreasing trend. Even the lower quartiles (lower boundary 
of boxes) show a decreasing trend. Statistical significance of these differences is assessed in 
Appendix [Bl 



2.3 Self-Promotion vs. Visibility 

We now consider the SP and V contributions to increased citations, taking a different ap- 
proach from that of [Dietrich, 2008b , as described in sec. 12. 1[ 

In the astro-ph dataset, we mark those articles submitted in the first 10 minutes after 
the deadline as "early" (E), a time period chosen from fig. [21 Of the 23,165 articles, 1049 
were marked as E, and the vast majority of those are likely to be self-promoted. The articles 
submitted after the first 30 minutes after deadline were marked "not early" (NE). The 
submitters of these are inferred to be indifferent about the position in the announcements. 
643 articles submitted after the first 10 minutes but before the first 30 minutes after deadline 
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astro-ph citations 
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Figure 3: Box plot of citations for different positions in astro-ph. Boxes represent the 
interquartile range, bounded above and below by the third and first quartiles, and the red 
horizontal lines mark the medians. 



were considered as ambiguous in author intent, so omitted from the analysis (which biases 
the results in neither direction). 

The median citation of the E articles is 20 while that of the NE articles is 9, and the 
difference in medians is significant using the MWU test at 1% significance level. The KS test 
at 1% significance level shows that the E citation distribution is as well higher than the NE 
distribution, in the global sense described in Appendix [Bl The rank-frequency (RF) plots 
of the two citation distributions, depicted in fig. |H indicate that self-promoting submitters 
by-and-large do have a good intuition for the likely future impact of their articles. Not 
all self-promoted articles, however, receive high citations: roughly 10% of the E articles in 
position 1 have no more than 1 citation. 



Position 


1 


2 


3 


4 


5 


Early 


510 


289 


146 


64 


24 


Not Early 


147 


299 


484 


613 


694 



Table 1: Number of E and NE articles in arXiv:astro-ph listed at positions 1-5 during the 
2002-2004 timeframe. 



To further probe the two biases, we separate out the E articles at each position. For the 
top five positions, the numbers of articles are shown in tabled] Fig. O shows the median num- 
ber of citations for each position. The red bars (E articles) characterize the self-promotion 
effect, while the green bars (NE articles) characterize the visibility effect. At every position, 
we see that the effect of self-promotion is much stronger than that of visibility, a difference 
significant at 1% level (MWU test) for the first 4 positions. The citation advantage of the 
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Figure 4: Rank-Frequency (RF) plot for astro-ph citations. The solid line is for articles 
submitted within the first 10 minutes after the weekday deadline of 16:00 eastern time. The 
dashed line is for the articles submitted after the first 30 minutes. 
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Figure 5: Median citations for each position for astro-ph announcements from the beginning 
of 2002 through the end of 2004. The red bars represented the 'self-promoted' articles. 
The non-self-promoted articles in the top few positions, represented by the green bars, 
nonetheless receive more median citations than those lower down in announcements. 



top few positions is thus largely due to self-promotion, but as we shall see there is as well a 
visibility effect. 

The differences between the blue bars in positions 1 and 5 in fig. [5] is statistically signif- 
icant (MWU test at 5% level), and while it is likely that this difference is entirely due to 
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the SP effect, there is not enough data in the green bars to make a statistically significant 
statement (at the same 5% level). But we can compare these articles to ones that appeared 
in lower positions. The median citation of articles in positions 10-40 is 9, while the median 
citation of non-SP articles in positions 1-3, i.e., submitted after the first 30 minutes, is 12. 
This difference of 3 citations (significant at the 1% level) is the extent to which visibility 
bias contributes to citations. The non-SP articles are randomly selected, independent of 
authorship, length, subject area with Astrophysics, or other confounding quality factors, yet 
solely by virtue of having appeared near the top of a web page or email announcement on one 
single day, are measured to receive significantly more median citations many years laterj^ 

We have also analyzed the data for the full period in fig. [21 i.e., the 43,686 astro-ph articles 
from the beginning from 2002 till the end of March 2007 announced in 1350 mailings. With 
citations again collected from ADS in August 2008, this gave at least roughly a year and half 
for the most recent articles to gather citations. The resulting graph has the same general 
form as fig. with greater significance and medians only 10% to 20% smaller. But the 
"median number of citations" for the enlarged dataset doesn't correspond to any particular 
set of articles, because it involves an average over articles of vastly different ages, with as 
much as six and a half years to as little as a year and a half to collect citations. For this 
reason we used the smaller data set for which the medians do correspond to median numbers 
of citations for 4.5-6.5 years old, and don't change appreciably when we further restrict the 
time window of articles considered. The early timeframe was chosen for stable citation data, 
although the SP effect became increasingly pronounced in the later data. 

2.4 hep-th and hep-ph 

Having confirmed the self-promotion phenomenon in the astro-ph subject area, we now 
consider the hep-th and hep-ph subject areas: the largest and most active of arXiv's high 
energy physics areas. The 776 daily announcements for those areas during the Jan 2002-Dec 
2004 period had averages of 12.8 and 15.9 articles, respectively. 

Figs. [6^,b show the number of hep-th and hep-ph submissions from the beginning of 
2002 through Mar 2007, in 20 minute submission bins. The first 20 minutes after 16:00 
eastern time have exceptionally high submission rates, although not as high as astro-ph 
(fig. [2]). Articles submitted in this 20 minute period are considered early (E) and the rest are 
considered not early (NE). We use the articles submitted from Jan 2002 through Dec 2004 
for our analysis, for reasons discussed at the end of the previous subsection. Of the 9,932 
total hep-th submissions during this period, 309 were submitted during the first 20 minutes 
and marked as E; and of the corresponding 12,281 hep-ph articles, 363 are marked as E, 
a similar percentage as for hep-th. Citations were collected from the SPIRES High-Energy 
Physics Literature Database in September 2008, giving the articles over three and half years 
to accumulate citations. The high energy physics literature, like the astrophysics literature, 
is served by a relatively small number of conventional published journals, and dominated by 

^For comparison with the bins used by [Dietrich, 2008a, b], articles announced in astro-ph positions 1- 
6, received a median of 14 citations, 55% higher than the median of 9 for those in positions 10-40. The 
NE articles in positions 1-6 received a median of 11 citations, pure visibility still giving 22% more median 
citations than those lower down. 
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a very small number of very large ones. 




Figure 7: Rank-Frequency (RF) plot for (a) hep-th and (b) hep-ph citations. The solid 
lines represents E articles, submitted within the first 20 minutes after the 16:00 eastern 
time weekday deadline. The dashed lines are for the remaining articles. 



The early hep-th and hep-ph articles are interpreted as self-promoted and, as seen in 
figs. [7^,b, their citation distribution stochastically dominates the rest (KS test at 1% signif- 
icance level). The median citation for hep-th position 1 is 12, while the median citation for 
positions 4-15 is a significantly lower 8. Similarly for hep-ph, articles at position 1 have a 
median citation of 14, while articles at positions 4-15 have a median citation of 7. 
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Figure 8: Box plot of citations for different positions in (a) hep-th and (b) hep-ph. Boxes 
depict the interquartile range, and the red lines mark the medians. 



hep-th: 



Position 


1 


2 


3 




Position 


1 


2 


3 


Early 


237 


58 


11 


hep-ph: 


Early 


282 


67 


12 


Not early 


537 


715 


759 




Not early 


492 


703 


756 



Table 2: Number of articles in arXiv:hep-th and arXiv:hep-ph listed at positions 1-3 during 
the 2002-2004 timeframe. 



Fig. M shows the medians and the quartiles of different hep-th and hep-ph positions. 
The first two positions have median number of citations significantly higher (at the 1% 
level) than the lower positions, and the difference between positions 1 and 2 is particularly 
striking. Fig. [9] disentangles self-promotion and visibility effects and, as in astro-ph, the 
self-promotion effect (red bars) dominates over the visibility effect (green bars), significant 
(1% level) for the first 2 positions. The effect is quite striking for the first position. The 
number of articles at each position is shown in table [2l Note that since there were only 11 
and 12 early articles at position 3, respectively, for hep-th and hep-ph, the red bars for this 
case in Figs. [9^,b are not statistically significant (and similarly for positions 4 and beyond). 

Visibility 

Although self-promotion is the dominant effect in the positional citation advantage in each 
of astro-ph, hep-th and hep-ph (figs. EUnt^jb), there was a pure visibility effect in the astro-ph 
data and here we find it as well in the hep-th and hep-ph data. For hep-th, the articles in 
position 1, but not early (green bar in fig. [9^), have a median of 11 citations. Articles in 
positions 5-10 have a median of 8 citations. This difference is significant at the 1% level. 
Similarly, for hep-ph, the articles in position 1, but not early (green bar in fig. [Hb), have 
a significant median citation advantage of 12 — 7 = 5 citations over the articles in position 
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Figure 9: Median citations for each position in (a) hep-th and (b) hep-ph, for announce- 
ments from the beginning of 2002 through the end of 2004. The red bars represented the 
'self-promoted' articles. 

5-10. The falling trends in the green bars in figs. [9K,b capture the beginning of this visibility 
effect. 



2.5 Discussion 

It is not within the purview of this article to attempt a detailed explanation of why a one- 
time visibility would leave its trace in the citation record years later. As we shall see in 
the readership data in the next section, articles in the top few positions receive more initial 
downloads, whether or not submitted early (i.e., self-promoted). The extra initial readership 
may probabilistically translate into a few early citations, which in turn could cascade into 
more citations later on. We could hope to model this in terms of some set of "fungible" 
articles, more or less similar in quality and subject area, with the ones cited determined 
by something of a social convention, based on artifactual collective effects within the citing 
community. This would parallel the behavior seen in studies of how social influence affects 



individual decisions and collective outcome in social networks Salganik, et al., 2006 



Citation practices differ from discipline to discipline, and there there are many known 
pitfalls of citation as measure of quality. Studies of subsets of geoscience [Stewart, 1983 



astrophysics [Baldi, 1998| , and demography [van Dalen fc Henkins, 20011 do at least suggest 
that citations primarily indicate some form of direct intellectual acknowledgement and infor- 
mation flow, rather than primarily reflecting reputational or other secondary social factors. 

But other features are known to be correlated to increased citation, including number 
of author^, number of pages, and also specifically visibility factors such as mainstream 



'Larger groups could be correlated with more funding and hence better equipment and past track record; 
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media coverage or being featured on a journal front cover. For example, it was shown 
in [Phillips et al., 1991 that major media coverage alone could lead to increased citations. 



Control for other factors in that study was provided by a serendipitous period for which 
there is a newspaper archive of stories that would have appeared, but were not disseminated 
due to a distribution strike: the journal research articles that would have been featured in 
those stories do not exhibit the same citation boost as did articles covered during periods 
of normal newspaper distribution. A similar effect can now be expected from visibility in 
blogspace, or via publicity in either blogspace or the media and amplified through feedback 
loops between them. The analyses of [Dietrich, 2008b| and this section were similarly able 
to isolate the role of visibility by exploiting the serendipity of randomly selected articles 
accidentally accorded high visibility without the conscious intent of the authors. 

Since a significant component of the citation effect is nonetheless due to intentional self- 
promotion, it is natural to wondei0 whether other forms of additional care taken during the 
submission process as well correlate with early submission, and hence with more citations in 
the long term. For example, it is optional for authors to provide their institutional affiliations 
parenthetically along with their names in the Author field. We find that 63% of the early 
astro-ph submitters provided affiliations, compared to only 43% of the not early ones. The 
total length of the metadata fields in arXiv has always been limited to prevent any one 
submission from monopolizing too much screen space. (Submissions exceeding the limit are 
automatically rejected until they are within the limit, just as in this journal's submission 
process.) But early submitters nonetheless took maximal advantage within the guidelines: 
the median length of the title for early submissions was 70 characters, compared to 66 for 
not early ones (the difference significant at 1% level KS), and the median length of abstract 
was also greater for the earlier submissions, 1177 compared to 1014 characters (i.e., 16 lines 
compared to 14 in the email announcements, with lines wrapped at the nearest whitespace 
to under 80 characters per line). 

By contrast, early and not early submissions had the same median number of authors 
(three), the same likelihood of providing initials rather than full first names of authors, 
and (reassuringly) there was no tendency for authors of early submissions to have longer 
last names, so the increased length of the overall author field (median of 70 characters 
compared to 62) was due entirely to the increased tendency of early submitters to provide 
author affiliations. The greater completeness of metadata and inferred submitter effort also 
correlates with greater citation impact even among only the non self-promoted articles: for 
not early submission with author affiliations provided, the median number of citations in the 
2002-2004 astro-ph dataset was 10 compared to 9 for those without, a statistically significant 
difference (1% MWU). (For the early submissions, the median number of citations was also 
greater for the submissions that provided affiliation, 20 compared to 19, but the difference 
was not statistically significant in that case.) 

The considerations in this section are also in principle independent of the 'citation ad- 
vantage' sometimes postulated for open access articles, since all of the articles in arXiv are 
equally open access. But if the existence of this one-time visibility effect suggests the pos- 
sibility of an open access advantage, then any analog of the self-promotion effect (i.e., that 



see also sec. 13.41 

^"^as was indeed wondered by an anonymous referee 
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articles more likely to be cited are a priori more likely to be deposited in an open access 
site) would have to be eliminated as the underlying cause. 



The latter self-selection effect Kurtz et al., 2005b was considered further as a 'quality 



bias' in Moed, 2007 and Davis &: Fromerth, 2007| , which studied respectively the citation 



impact of those articles in the Condensed Matter (cond-mat) and Mathematics (math) sec- 
tions of arXiv later published in journals, as compared to articles in the same journals but 
not deposited in arXiv. Both 'early view' (advance availability on arXiv prior to publication 
in journal) and 'quality bias' (higher quality articles more likely to be posted on arXiv) are 
potential confounding effects that could lead to an artifactual citation advantage, and it was 
found that correcting for those left no general 'open access advantage' for articles deposited 
in arXiv. Similarly, in a study of open access articles published in eleven scientific journals, 
Davis et al., 2008] used a randomized controlled trial to eliminate biases from other quality 



indicators: whether self-archived, featured front cover of journal, received press-release, and 
other confounding attributes (nature of article, number of authors and geographic location, 
number of references, article length, journal impact factor), and later estimated their effect. 
This study as well found that any citation differences were due to factors other than open 
access per se: while those articles randomly assigned open access status received more full 
text downloads, they were no more likely to be cited a year later. 

In Jan 2009, the astro-ph section of arXiv was subdivided into six smaller subsections. 
It remains possible to receive the combined daily listings for all subsections, but many users 
expressed a preference to be able to browse only the restricted subsets. This division into 
smaller announcements will in principle ameliorate some of the positional effects, but not 
all, since the larger of these subsections still average more than ten new submissions per 
day. Some users have suggested randomizing the daily order entirely, either uniformly for 
everyone, or individually for each user. Others have pointed out that such a methodology 
would potentially do a disservice to readers, who may indeed be benefitting from having 
self-promoted articles brought preferentially to their attention (presuming those really are 
the more likely to be of importance in the long-run). Perhaps a better methodology is 
afforded by personalization, by which users can register to receive daily announcements based 
on their preferences, and ordered accordingly!"] These preferences can be indicated via a 
controlled vocabulary of keywords, or via arbitrary search terms, and can be implemented 
in combination with data from a user's own past on-line reading behavior at the site, on an 
opt-in basis. 



3 Readership Data 

Since citations can signify some long-term reflection of quality (positive or negative), it 
is reassuring that the positional advantage of citation is primarily due to self-promotion, 
rather than to a one-time visibility effect. In this section, we consider the visibility effect on 
readership, and more generally consider how readership features can be used to predict the 
number of citations of an article. We will use full-text downloads as a proxy for readership. 

"'^"'^Such a personalization system has been available to the subset of readers using the my ADS features of 
the NASA ADS system, at http://myads.harvard.edu/ . 
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The download data is from the main arXiv site only, though constitutes a representative 
sample. It is cleaned of robotic accesses and multiple repeat accesses from the same domain 
within a small timeframe. Many articles are made available at the arXiv site in advance of 
publication by a peer-reviewed journals, though some authors await the results of peer review 
and make them available at arXiv.org more or less simultaneously with their appearance in 
a conventional journal. 



3.1 Previous Work 

Past studies have explored the relationship between downloads and citations. Using ADS 



data from 7.66 months of 2001, including more than 1.8 million "reads", Kurtz et al., 2005a 
studied, among other things, the mean relation between reads and cites, and estimated 
roughly twenty ADS reads per citation for that period. [Perneger, 2004| investigated the re- 
lationship between citations and first week's downloads for 153 articles in the British Medical 
Journal (vol. 318 from 1999), and found that the first week's download activity appeared to 
capture subsequent article citability. |Moed, 2005| computed the correlation between down- 
loads and citations using a larger sample from the journal Tetrahedron Letters: 1,190 short 
articles published during the first half of 2001, with about 410,000 total downloads and 
4,300 total citations. [Brody et al., 2006|| discussed the correlation between early downloads 
(minus the first seven days) and citations of arXiv articles deposited 2000-2002. The data 
in this case came only from a single arXiv mirror, since the more voluminous data from the 



main site was not publicly available. Neurosci Editor, 2008 considered the relation between 



early downloads (first 90 days) and future citations for a few hundred articles that appeared 
in Nature Neuroscience during the period Feb-Dec 2005, and found a usefully predictive 
correlation, despite a comparatively small level of download activity. 

In what follows here, we use a data set considerably larger than the data used in those 
studies, and moreover a different methodology. Downloads and citations are typical heavy- 
tailed rather than normal distributions, so measures such as mean and standard deviation 
are less useful. Instead of computing a simple correlation between two variables, we consider 
the problem as a prediction task and use modern machine learning tools. Finally, we focus 
on the positional effect on readership, an effect not considered at all in the above, although 
any general relation between readership and citation, combined with a positional effect on 
citations investigated in the previous section, would naturally imply a positional effect on 
readership. 



3.2 General Pattern 

We use the readership for articles in the astro-ph, hep-th and hep-ph subject areas of arXiv 
received from Jan 2002 through Mar 2007. The dataset contains the date and time of every 
full-text download for each article through the end of 2007. There is great variation in the 
temporal readership pattern of articles, but the general feature is a burst of initial readership 
during an "active" period, and only sparse readership thereafter^ The existence of such 
an "active" period is an indication of the extent to which readers track the research via 

^^This permits use of the full 5+ years of data, unlike the citation study of the previous section. 
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Figure 10: Fraction (in the subject area) of articles having > 10 reads on a day. 



the daily announcements of new submissions. In fig. Uni we see that almost all articles are 
downloaded at least 10 times on the day they are first made public, and that fraction then 
falls rapidly@ For astro-ph, less than 1% of the articles have 10 or more downloads per day 
after the first 10 days. We take 1% to be the threshold of activity, so the active period for 
astro-ph is taken to be roughly 10 days. For hep-th, this period is 25 days, while for hep-ph 
it is 15 days. The total number of downloads in the active period can be taken as a measure 
of the initial popularity of an article. 

Beyond the active period, typical articles receive no downloads on most days0 In astro- 
ph, for example, an average article is downloaded at least once during 12% of the days of its 
lifetime. For hep-th and hep-ph, this number is 13% and 17% respectively, with a standard 
deviation of about 10%. Readership can therefore be characterized by the number of days 
an article gets at least some downloads. Since the articles are of varying age in our dataset, 
we compare their readership activity beyond the active period by using the fraction of days 
an article gets downloaded at least once. It is natural to ask if there is a correlation between 
total initial reads and later (long-term) fraction of days getting some reads. Table [3] shows 
that indeed the fraction of later days getting some downloads is quite strongly correlated 
with initial popularity, by two common statistical measures!^ 



^■^The seven day periodicity in fig. [10] results from the confluence of lower weekend readership with an- 
nouncements of articles being made only on the five weekdays. We also checked for a possible "day of the 
week" bias, but found that the particular day of the week that an article is announced has no effect on the 
median number of citations. 

^''The articles that tend to have the most usage in the long-term are review articles and other pedagogical 
resources such as lecture notes. Ironically, this long-term usage is frequently not reflected in the citation 
record. These articles constitute a small enough fraction of the total that they do not skew the data. 

The Pearson correlation coefficient is a parametric statistic computed directly using the values. The 
Spearman correlation coeflicient is the nonparametric version of Pearson, replacing the values with their ranks 
in sorted order. Correlation coefficients range from —1 and +1, where -1-1 indicates a linear correlation, 
no correlation, and —1 linear anti-correlation. A value of 0.5 or more is ordinarily considered high. 
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astro-ph 


hep-th 


hep-ph 


Pearson 
Spearman 


0.5861 
0.6716 


0.7436 
0.7525 


0.6625 
0.6750 



Table 3: Correlation (P = 0) between the number of downloads in the active period with 
the fraction of days, beyond the active period, an article is downloaded at least once. 



3.3 Positional Effects 

We now examine the relation between article position on the day of announcement and the 
total number of downloads in the initial active period. 




Figure 11: (a) Box plot of total astro-ph downloads in the active period for different 
positions. Each box extends from the first through the third quartile, and the red line 
marks the median. The vertical dashed lines extend above and below to the largest and 
smallest values within 1.5 times the interquartile range from the respective quartile. The 
red '+' signs represent "outlier" points above this range, (b) Median total reads for each 
position, with the red bars isolating the SP effect and the green bars the V effect. 



For astro-ph, we see from fig. [TTb that the number of downloads is higher for the top po- 
sitions, and the median number declines with position for the early positions. The difference 
between the first and the second positions is quite striking. The differences in medians for the 
first six positions are statistically significant at the 1% level. The stochastic dominance of the 
distributions for different positions is also significant. Position 1 receives roughly twice the 
median number of initial reads as positions 10-40, indicating a very strong positional effect. 
Fig. [TTb shows that the positional effects in readership are dominated by self-promotion: the 
difference between the red and green bars is significant at the 1% level for each of the first 
5 positions. Comparing the green bars, representing "not early" submissions, with those of 
fig. we also see that the visibility bias is much stronger in the initial popularity of an 
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Figure 12: (a) Box plot of total hep-th downloads in the active period for different positions, 
as in fig. [TTk for astro-ph. (b) Median hep-th reads for each position, as in fig. [TTb for astro- 
ph. 



hep-ph initial reads 



hep-pti initial reads 



180 
160 - 
140 
120 
100 - 

80 - 

60 - 

40 

20 - 
- 



160 



i i 



2 3 4 5 6-7 8- 10 
positions 



11-15 




(a) (b) 

Figure 13: (a) Box plot of total hep-ph downloads in the active period for different positions, 
as in fig. [TTb for astro-ph and fig. [T2b for hep-th. (b) Median hep-ph reads for each position, 
as in fig. Illb for astro-ph and fig. 112b for hep-th. 



article than in its long term citations, especially for the first position: the green bars in 
fig- [TTb show a significant drop from the first position to the next four0 



-"^^For comparison with the larger bins mentioned in subsection l2.3l (footnote after fig.E]), astro-ph articles 
in positions 1-6 received a median of 105 downloads, 44% higher than the median of 73 for those in positions 
10-40. NE astro-ph articles in positions 1-6 received a median of 88 downloads, still 20% higher than for 
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For the relation between article position and initial downloads for liep-th and hep-pli, 
figs. [T2bJT3h show a strong initial download advantage for the first two positions, and the 
green bars in figs. [T2b|T3b indicate a strong visibility effect for them. We confirm that the 
visibility effect can play a strong role in the number of early reads even in the smaller hep-th 
and hep-ph announcements. 

As pointed out earlier, article position is a one-time artifact of the initial announcement, 
persisting only for a single day. It is very difficult, if not impossible, to imagine any positional 
effect on citations in the absence of even stronger positional effects on initial reads. The above 
initial readership data for astro-ph, hep-th, and hep-ph provide a consistent underpinning 
for the citation results of the previous section, and are certainly consistent with some form 
of causal relationship. 



3.4 Correlating Citation with Readership Features 

The download data was also analyzed to discover the extent to which article readership 
predicts citations, and in principle gives some initial measure of article quality. Obvious 
features that could potentially be correlated with citations are the total downloads, total 
downloads in the active period, and total number of days getting some downloads. Articles 
whose initial active period is much shorter than average (e.g., 3 days rather than 10) do 
tend to get somewhat fewer citations in the long run, as would be expected for lower quality 
articles, rapidly identified as such by discerning readers. In astro-ph, for example, roughly 
2.5% of the articles have 95% or more of their initial active period downloads during the first 
3 days. These receive a median of 4 citations, whereas the remaining articles have a median 
of 7 citations, a difference statistically significant at the 1% level. The fraction of active 
period downloads occurring in the first 3 days could thus be another predictive feature. 

It has been observed [Stewart, 1983 Baldi, 1998 van Dalen &: Henkins, 20"0T] that the 
number of citations is positively correlated with the number of authors of an article. Since 
articles accumulate citations with time, their age will have some correlation with the number 
of citations. As discussed earlier, self-promoted early articles receive more citations, perhaps 
due to higher quality, and position in the mailing may result in a visibility effect: thus 
whether or not an article is early and its position are important features. 





E 


P 


A 


AR 


F 


D 


TR 


AG 


astro-ph 


0.113 


-0.087 


0.25 


0.2753 


0.069 


0.326 


0.328 


0.086 


hep-th 


0.07 


0.013 


0.256 


0.4825 


0.25 


0.61 


0.593 


0.07 


hep-ph 


0.092 


-0.02 


0.27 


0.41 


0.212 


0.642 


0.61 


0.08 



Table 4: Spearman rank correlation between the number of citations with different features: 
early or not (E), position in mailing (P), nmnber of authors (A), reads in the active period 
(AR), fraction of active period reads outside the first 3-5 days (F), number of days beyond 
the active period getting some reads (D), total reads during lifetime (TR), age in days (AG). 

These features are all correlated in some way with the number of citations. We use the 
citation and readership data for papers submitted between Jan 2002 and Dec 2004 for our 



those in positions 10-40. 



19 



analysis. Table H] shows the rank correlation between number of citations and the different 
features individually. The feature most correlated with the ultimate number of citations is 
the number of days beyond the active period an article gets some downloads. Steady reads 
beyond the initial period are thus most predictive of citations, although initial reads are 
useful as well. Reads in this case can even be a consequence of the citations, since citations 
can lead readers directly to the arXiv site, hence the correlation^ The total number of 
reads is also well correlated with citations. 
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16 


3 


69 


0.15556 


96 


185 


1598 


hep-th 





7 


2 


137 


0.17966 


162 


346 


1636 


hep-ph 





9 


2 


79 


0.16949 


118 


228 


1630 



Table 5: Medians of the quantities in table HI 



For completeness, in table [5] we give the medians of the quantities in table HJ but empha- 
size that the details of the distribution reflected in the rank correlation are not well captured 
by an aggregate quantity like the median. In addition, many of the medians are intrinsically 
unilluminating. For example, whether or not an article is early (E) is a binary feature, and 
the median is since more than half the articles are not early. The median position (P) will 
be very close to half of the average mailing length since each of the positions has the same 
number of articles up to that length. The median age (AG) is constrained to be roughly 5 
years for articles that range from 3.5-6.5 years old, and the median reads in the active period 
(AR) have already been given in figs. [TT] - [T3l Apart from the small fraction of articles that 
lose readership very quickly, the distribution of the fraction of active period reads after the 
first 3-5 days (F) will not differ substantially from the overall pattern of exponential falloff 
in readership. 

Is there a meaningful way to harness the combined predictive capacity of the above 
features? The next logical step beyond correlation is to use regression. In addition to the 
above features, we have used the daily number of downloads for each of the first 100 days since 
the initial period is of much interest, and used the Support Vector Machine implementation 
gYMiighiji8| [Joachims, 1999] , a modern supervised machine learning tool (see Appendix [U]), to 
predict citations. The methodology involved normalizing every feature by the 95*^ percentile 
of its set of values, to avoid convergence problems due to features having values that differ 
by several orders of magnitude. Since features like the initial and total reads have heavy- 
tailed distributions, norms like 1-norm, 2-norm or the oo-norm would be dominated by the 
few large values, and hence normalization by any of these norms would result in setting the 
small values effectively to zero. 

^'^Whether or not citation or use of bibliographic database leads readers to a journal site after publication or 
still to the arXiv site depends on how an article is cited, and also on the readership habits of the community, 
which could differ between the high energy physicists and the astrophysicists. Even the initial period in 
astro-ph is more likely to share readership with a journal version, since astrophysicists occasionally make 
arXiv submissions simultaneous with journal acceptance, while high energy physicists tend to make arXiv 
submissions hot out of the word-processor. 

^^http: / /svmlight. joachims.org/ 
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After normalization, the data set was randomly split into five equal parts. Then we ran 
SVM''^'^* in its regression mode Smola Scholkopf, 2004] (with the default linear kernel) five 
times, using in turn each of the five parts as test set, and the remaining 80% as training set 
in each case. This is the standard 5-fold cross-validation procedure to ensure no overfitting of 
the data. For every run, the predicted citations were compared against the true citations to 
compute the predictive accuracy. Once again, since citations follow power law distributions, 
it is preferable to compare the ranking of the articles by the predicted citations and the 
ranking produced by the true citations, rather than comparing the actual magnitudes. This 
was done by computing the Spearman rank correlation coefficient between the predicted 
citations and the true citations, with the numbers then averaged for the 5 runs. 





astro-ph 


hep-th 


hep-ph 


Average 
Standard Deviation 


0.3930 
0.0211 


0.5998 
0.0074 


0.6326 
0.0168 



Table 6: Spearman rank correlation coefficient between the actual citations and the cita- 
tions predicted by the SVM regression. 



Table [6] shows the extent to which regression was successful in ranking the articles. 
For hep-th and hep-ph the correlation is indeed quite high. For astro-ph the correlation 
is smaller, but still substantial. One possible explanation for this smaller correlation is 
that astro-ph citations more frequently lead to readership of the journal version, and are 
not captured by arXiv readership data as well as are citations in the hep-th and hep-ph 
literatures, whose readers are by habit more likely to consult the version resident on the 
arXiv server. To assess this possibility, we folded in data, kindly provided by ADS, giving 
the number of full text downloads directed to the publishers via ADS (rather than to arXiv). 
This number is strongly correlated with the number of citations (roughly Spearman 0.5 for 
articles eventually published in a journal). Used as an additional feature in our SVM setting, 
the rank correlation in table [6] shifts to 0.7 for astro-ph, now comparable to and even slightly 
higher than the hep-th and hep-ph correlations. 





astro-ph 


hep-th 


hep-ph 


Average 
Standard Deviation 


0.3869 
0.0075 


0.577 
0.0214 


0.5812 
0.0200 



Table 7: Spearman rank correlation coefficient between the actual citations and the cita- 
tions predicted by the SVM regression, but without using the total reads and the long-term 
fraction of days receiving downloads. 



As noted earlier, reads beyond the initial period, characterized both by the number of 
days beyond the active period when a paper gets some reads, and by the total number of 
reads, are most strongly correlated with citations. These two features are not necessarily 
predictive, however, since later reads at the arXiv site can result in future citations but 
can also result from citations, either directly or indirectly due to increased interest in an 
article. Table [7] shows the results of removing these two features and again running the SVM 
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regression again with a 5-fold cross-validation. The correlation weakens slightly, as would be 
expected, but the early number of reads remains highly predictive of the long term citation 
behavioro 



3.5 Discussion 

There is no direct analog in other on-line resources for the positional effects on readership 
considered in sec. 13.31 both due to the nature of the arXiv daily announcements and the 
central notification role arXiv plays for entire research communities. We've seen that vis- 
ibility plays a strong unintentional role as a recommender. The readership effects of the 
top few positions can be understood in terms of a stochastic decay-of-attention model, in 
which there is some probability of distraction at each entry, either by pausing to read the 
associated article full text, or by some external event. The reader either never returns to 
the original window to read the rest of the list, or having already spent time looking at full 
text becomes less likely to retrieve later full texts for perusal. The difficulty in eliminating 
such effects provides an additional rationale for offering personalization services to readers: 
when different readers view customized announcements ordered according to their individual 
preferences, the artifactual visibility biases of a single global list no longer play a dominant 
resonant role for the full research community. 

The overall correlation we have found between citation and various readership features 
in table H] confirms in a modern electronic context the primary intellectual role played by 
citation. Rather than playing some symbolic or primarily social role, or thoughtlessly prop- 
agated without consultation of sources, citations both appear clearly as a consequence of 
readership, and lead to further readership. The relation found here between readership 
and later citations amplifies the results of previous studies [Perneger, 2004[ Moed, 2005 



Brody et al., 2006t [Neurosci Editor, 2008| on the highly predictive role played by early read- 
ership. It is thus tempting to try to incorporate early readership and other newly available 
measures of popularity, such as blog commentary or other 'Web 2.0' commentary mech- 
anisms, into some form of early guide to readers; and later in an article's lifetime into 
some more generalized impact metric, incorporating citations as well. On the other hand, 
we've documented here that accidental forms of visibility can drive early readership, with 
consequent early citation potentially initiating a feedback loop to more readership and cita- 
tion, ultimately leaving measurable and significant traces in the citation record. Thus while 
citations are not primarily used for social purposes, they may nonetheless be subject to indi- 
rect infiuences familiar from studies of social networking effects [Salganik, et al., 2006| , and 
thereby not provide an impact metric with the desired objectivity. 

Other early activity measures correlated with long-term popularity have recently been 
considered for on-line sites such as YouTube and Digg [Szabo fc Huberman, 2008] , where the 



effect of early feedback mechanisms is found to be even more pronounced. There are many 
areas of superficial similarity between on-line scholarship sites on the one hand, and news and 
commerce sites on the other, but in the context of the results presented here it is important 
to recall their very different motivations for recommender mechanisms. An on-line newsite 
that draws the attention of readers to popular articles increases the number of article reads 
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Another highly predictive feature that we did not analyze in detail here is the time to the first citation. 
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and hence chances to bring advertisements in front of readers. An on-hne commerce site 
that successfully recommends other popular items increases its number of products ordered 
and gross revenues. By contrast, a scholarly site that focuses attention on a smaller number 
of articles, either intentionally or otherwise, could do an inadvertent disservice to both its 
authors and readers. 
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A Power Law Fitting 

To fit data to a power law, often the method of maximum likelihood estimation is used to 
compute the power law exponent, followed by a least squares fit of a straight line with the 
computed slope in a log- log plot. 

For a power law distribution with p{x) oc the cumulative distribution F{X > x) oc 
also a power law. The cumulative distribution F{X > x) is smoother so is 
customarily used for power law fitting, and is often plotted as a Rank- Frequency (RF) plot 
[Newman, 2005| . Swapping the axes of an RF plot gives the Zipf plot, which follows a power 
law behavior with the inverse of the RF exponent. 



Dietrich, 2008a| gives the Zipf plots of citations for the top 10 positions and the remaining 
positions, binned appropriately. These curves give the cumulative distribution function 
F[X > x) for different positions, and are useful in comparing two distributions for stochastic 
dominance. A cumulative distribution F{X > x) is said to stochastically dominate (first 
order) [Bawa, 1975 another cumulative distribution G{X > x) iff for all x we have 



F{X >x)> G{X > x) . 

In risk analysis, it is always safer to gamble according to the dominating distribution, since 
it is expected to produce higher values. If one RF curve is always above another, then there 
is stochastic dominance, although the statistical significance of the dominance needs to be 
verified. 

In Dietrich, 2008a , the citation distribution of the top position is found to be higher 
than the lower positions. The power law exponent of the Zipf plot in Dietrich, 2008a is 
f3 = 0.48, in accord with Redner, 1998| . The power law exponent of the citation distribution 



is thus a = 1 + = 3.0833. At this value of a, the mean [Newman, 2005] citation is 



a 



a 



If there is an upper limit, Xmax, then the mean becomes 



{x) 



a 



a-2 
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To restrict to the region where the power law is vahd, the small and large rank regions 
of the Zipf plots are excluded in Dietrich, 2008a| , introducing (a) a normalization bias, and 
(b) a potential bias of eliminating a large fraction of the data; as we now describe: 

• (a) Given two curves, say citations corresponding to position 1, and to positions 10-40, 
the restriction to the power law region introduces cutoffs xj^j^ > x 
where 



10-40 



10-40 



and x^^^ > x^g^^ 



10-40 
min 



I0g2^max-l0g2^: 



10-40 
max 



(since log- log plots of two parallel straight lines are equidistant at the endpoints). This gives 



max 

a-2 



min 



pl0~40 
''max 



The cut-off in Dietrich, 2008a was such that 



mm 



2, so it is not clear whether the factor 
of 2 advantage in the average was due to the cut-off having given (x^) the benefit of higher 

• (b) Our analysis of the same data gives a median citation for position 1 of 10, and 
for positions 10-40 of 4. A large lower cutoff will thus ignore a large fraction of the data. 

50, whereas the 75*^^ percentile of the citations for position 



Ref. [Dietrich, 2008a| used x^j^ ^ 



1 is 22. This means at least | of the data was ignored to compute the aggregate values. 



B Statistical Significance 

To test the statistical significance of the difference in median citations, we use the Mann- 
Whitney U (MWU) test, with the null hypothesis that the medians are equal, and the two- 
sided alternative that the medians are not equal, at 1% significance level. Table [8] shows that 
for astro-ph the medians of the top 5 positions are significantly different from the medians 
of the positions 10 and beyond. 

A significant difference in median does not necessarily mean a distribution is better at all 
levels. To test stochastic domination, we used the Kolmogorov-Smirnov (KS) test with the 
null hypothesis that the two distributions are the same, and the one-sided alternative that 
the first distribution dominates the second, at 1% significance level. Table [9] shows that for 
astro-ph the first 5 positions are indeed better than all other positions, for all values. 



C SVM Regression 

SVM regression [Smola &: Scholkopf, 2004] is different from the standard regression task in 
two ways. Firstly, SVM uses the e-insensitive loss function where for individual sample 
points only an error of greater than e counts as "error" , and the total error is the sum of the 
samplewise errors. Secondly, the minimization function is a combination of the e-insensitive 
loss function as well as the squared norm of the vector of regression coefficients. The tradeoff 
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Position from 


Position onwards 


1 


2 


2 


5 


3 


5 


4 


7 


5 


11 


6 


11 



Table 8: Mann- Whitney U test for astro-ph. Left column is the position whose median we 
are assessing for significant difference (1% significance level) with a two-sided alternative 
(either median the greater). The right column is the position whose median (and that 
of positions beyond) is significantly different from the corresponding position on the left 
column. For example the median number of citations for position 2 is greater than that of 
positions 5 and beyond, at 1% significance level. 



Position from 


Position onwards 


1 


4 


2 


5 


3 


6 


4 


7 


5 


11 



Table 9: Kolmogorov-Smirnov test for astro-ph. Left column is the position whose distri- 
bution we are assessing for stochastic domination (1% significance level) with a one-sided 
alternative. The right column is the position whose distribution (and the positions beyond) 
is stochastically dominated by the corresponding position on the left column. For example 
the median number of citations for position 2 is greater than that of positions 5 and beyond, 
at all levels, at 1% significance level. 



between this norm and the loss function is controlled by a parameter C. The algorithm takes 
both e and C as parameters, and setting small e and large C gives a form of least squares 
result. SVM regression uses state of the art constrained optimization techniques to find a 
solution. Its real power, however, is the ease with which nonlinearity can be incorporated 
by higher order kernels. The efficiency and accuracy of this approach has already been 
established firmly in the realm of machine learning through numerous principled applications. 

To explore the predictive capacity of readership and other features, we treated it as a stan- 
dard supervised prediction task in machine learning. Some past attempts to correlate cita- 
tions with article and author features Stewart, 1983,Baldi, 1998, van Dalen fc Henkins, 200l| 



used samples that were several orders of magnitude smaller and hence allowed manual ex- 
traction of features^ In such a setting regression is used for the entire dataset and the total 
error is reported. A potential problem with this approach is that it may simply validate the 



^°While manual extraction of features is not as feasible on the larger datasets currently in use, modern 
text-mining tools togetlier witli tlie increased availability of the full-texts in digital form should ultimately 
permit automated extraction of a comparable set of features. 
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regression model used, rather than result in learning and prediction. Use of the full dataset 
may also be vulnerable to overfitting through extraneous features. In machine learning, 
the standard approach is to cross-validate through random training and test splits of the 
data, and report the average accuracy on the test sets. This puts less emphasis on human 
verification of the model being learned, especially when higher order kernels are used. 
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